home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Freaks Macintosh Archive
/
Freaks Macintosh Archive.bin
/
Freaks Macintosh Archives
/
Phreaking⁄Wardialers
/
Phreaking texts
/
FindBugs.txt
< prev
next >
Wrap
Text File
|
1999-01-28
|
8KB
|
143 lines
exerpted from -
Science News, feb 26, 1991, vol 139 no. 7
Finding Fault
the formidable task of eradicating software bugs
typed by: Horatio
(contact at DRU, 8067944362)
this is about half of the article. the rest went on to describe the unint-
eresting problems of attempting to design software safeguards for nuclear
power plants
-----------------------------------------------------------------------------
[several paragraphs of intro material]
the software glitch that disrupted at&t's long-distance telephone service for nine hours in january 1990, dramatically demonstrates what can go wrong
even in the most reliable and scrupulously tested systems. of the roughly
100 million telephone calls placed with at&t during that period, only about
half got through. the breakdown cost the company more than $60 million in
lost revenues and caused considerable inconvenience and irritation for
telephone-dependent customers.
the trouble began at a "switch" - one of 114 interconnected, computer
operated electronic switching systems scattered across the united states. these sophisticated systems, each a maze of electronic equipment housed in a
large room, form the backbone of the at&t's long-distance telephone
network.
when a local exchange delivers a telephone call to the network, it
arrives at one of these switching centers, which can handle up to 700,000
calls an hour. the switch immediately springs into action. it scans a
list of 14 different routes it can use to complete the call, and at the
same time hands off the telephone number to a parallel, signalling network,
invisible to any caller. this private data network allows computers to
scout the possible routes and to determine whether the switch at the other
end can deliver the call to the local company it serves.
if the answer is no, the call is stopped at the original switch to keep
it from tying up a line, and the caller gets a busy signal. if the answer
is yes, a signaling-network computer makes a reservation at the destination
switch and orders the original switch to pass along the waiting call - after
that switch makes a final check to ensure that the chosen line is
functioning properly. the whole process of passing a call down the network
takes 4 to 6 seconds. because the switches must keep in constant touch
with the signaling network and its computers, each switch has a computer
program that handles all the necessary communications between the switch
and the signaling network.
at&t's first indication that something might be amiss appeared on a giant
video display at the company's network control center in bedminster, nj.
at 2:25 pm on monday, jan 15, 1990, network managers saw an alarming
increase in the number of red warning signals appearing on the many of 75
video screens showing the status of various parts of at&t's world-wide
network. the warnings signaled a serious collapse in the network's ability
to complete calls within the united states.
to bring the network back up to speed, at&t's engineers first tried a
number of standard procedures that had worked in the past. this time, the
methods failed. the engineers realized they had a problem never seen
before. nonetheless, within a few hours, they managed to stabilize the
network by temporarily cutting back on the number of messages moving
through the signaling network. they cleared the last defective link at
11:30 that night.
meanwhile, a team of more than 100 telephone technicians tried fantically
to track down the fault. by monitering patterns in the constant stream of
messages reaching the control center from the switches and the signaling
network, they searched for clues to the cause of the network's surprising
behavior. because the problem involved the signalling network and seemed
to bounce from one switch to another, they zeroed in on the software that
permitted each switch to communicate with the signalling network computers.
the day after the slowdown, at&t personnel removed the apparently faulty
software from each switch, temporarily replacing it with an earlier
version of the communications program. a close example of the flawed
software turned up a single error in one line of the program. just one
month earlier, network technicians had changed the software to speed the
processing of certain messages, and the change had inadvertantly introduced
a flaw into the system.
from that finding, at&t could reconstruct what had happened
the incident started, the company discovered, when a switching center in
new york city, in the course of checking itself, found it was nearing its
limits and needed to reset itself - a routine, maintenance operation that
takes only 4 to 6 seconds. the new york switch sent a message via the
signalling network, notifying the other 113 swithces that it was
termporarily dropping out of the telephone network and would take no more
telephone calls until further notice. when it was ready again, the new
york switch signaled to all the other switches that it was open for
business by starting to distribute calls that had piled up during the brief
interval when it was out of service.
one switch in another part of the country received its first messages
that a call from new york was on its way, and started to update its
information on the status of the new york switch. but in the midst of that
operation, it received a second message from the new york switch, which
arrived less that a hundreth of a second after the first.
here's where the fatal software flaw surfaced. because the receiving
switch's communication software was not yet finished with the information
from the firest call, it had to shunt the second message aside. because
of programming error, the swtich's processor mistakenly dumped the data
from the second message into a section of its memory already storing
information crucial for the functioning of the commucations link. the
switch detected the damage and promptly activated a backup link, allowing
time for the original communication link to reset itself.
unfortunately, another pair of closely spaced calls put the second
processor out of commission, and the entire switch shut down temporarily. these delays caused further telephone-call backups, and because all the
switches had the same software containing the same error, the effect
cascaded throughout the system. the instability in the network persisted
because of the random nature of the failures and the constant pressure of
the traffic load within the network.
although the software changes introduced the month before had been
rigorously tested in the laboratory, no one anticipated the precise
combination and pace of events that would lead to the network's
near-collapse.
in their public report, members of the team from at&t bell laboratories
who investigated the incident state: "we believe the software design,
development and test processes we used are based on solid, quality
foundations. all future releases of software will continue to be rigorously
tested. we will use the experience we've gained through the problem to
further improve our procedures."
in spite of such optimism, however, "there is still a long way to go in
attaining dependable distrubuted control," warns peter g. neumann, a
computer scientist with sri international in menlo park, california.
"similar problems can be expected to recur, even when the greatest pains
are taken to avoid them."
[more uninteresting nuclear reactor stuff]
-----------------------------------------------------------------------------
EOF
thanks go out to everybody in the hack/phreak world who is/was kind enough
to type up a few bytes of information for the education/amusement of all, particularly: cDc, toxic shock, phrack, phun, LOD, NIA, and CUD.